Multiple Imputation of Missing Complex Survey Data using SAS®: A Brief Overview and An Example Based on the Research and Development Survey (RANDS)

Multiple imputation (MI) is a widely used analytic approach to address missing data problems. SAS® (SAS Institute Inc, Cary, N.C.) has established MI procedures including PROC MI and PROC MIANALYZE. We illustrate the use of these procedures for conducting MI analysis of complex survey data by an example from RANDS. Section 1 contains the introduction. Section 2 includes some necessary methodological background. Section 3 shows a MI example with an arbitrary missing data pattern. Section 4 concludes the paper with a discussion.


Introduction
Population-based studies often rely on surveys to collect information and conduct data analysis. However, survey data are often subject to nonresponse or missing data problems. Multiple imputation (MI) is arguably one of the most popular statistical strategies to handle missing data issues in many fields (Rubin 1987;He et al., 2022) including survey nonresponse problems.
The default option in statistical software is to remove cases with missing values from the analysis (i.e., case-deletion). The practicality of MI sits on its successful implementations in some mainstream software packages (e.g., SAS ® and R) so that practitioners can use straightforward programming statements to conduct the analysis. For example, Berglund and Heeringa (2014) provided an overview of MI and its applications, using SAS ® for illustration. Similar research literature can be found for other software packages. In addition, practitioners can refer to the software documentation for guidance. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 1 wdq7@cdc.gov .
Missing data problems in complex surveys pose some unique challenges (Section 2). For survey item nonresponse problems, MI has been proven to be a useful analytical tool supported by a large body of literature (e.g., Rubin 1987;He et al. 2022, Chapter. 10). However, most of the literature has focused on the technical aspects of MI and yet touched less on the programming components. In addition, the relevant programming literature and documentation are largely targeted to non-survey types of data.
To fill this gap, the aim of this paper is to provide a brief overview and a real example of MI for complex survey data using SAS ® programming statements (version 9.4; additionally, the users can also use the free cloud SAS platform on https://www.sas.com/en_us/software/ on-demand-for-academics.html).

Missing data mechanism
Briefly speaking, the missing data mechanism of an incomplete variable describes how the probability of its missingness (i.e., being missing) is related to the original data. In general, there are three types of missing data mechanisms: (1) Missing completely at random (MCAR): the missingness of a variable is not related to any variable in the data; (2) Missing at random (MAR): the missingness of a variable is only related to other fully-observed variables in the data; (3) Missing not at random (MNAR): the missingness of a variable is related to the missing values after controlling for other fully-observed variables.

Multiple Imputation
To conduct a MI analysis of a dataset, an appropriate missing data mechanism (e.g., MAR) is first assumed. Then a statistical imputation model is formulated to relate the missing variable(s) with observed variable(s) in the dataset. Next, missing values are imputed (i.e., replaced) by random draws from their posterior predictive distributions or their approximations derived from the imputation model. Such a procedure is independently repeated multiple (say M) times, resulting in M sets of imputed values. Early research (e.g., Rubin 1987) suggested setting M=5 is sufficient for regular analyses applied to datasets with a small or moderate amount of missing data. More recent research (e.g., He et al. 2022, Section 3.3.3) has shown that larger numbers (e.g., M > 5) might be desired when computing and data storage resources are available. After imputation, each of the M completed datasets, including both the observed and the imputed values, is analyzed separately and results in M sets of analysis results/estimates. Finally, these M sets of results are combined to yield a single set of statistical inference using the so-called Rubin's combining rules (Rubin 1987).

Multiple Imputation for Complex Survey Missing Data Problems
Most surveys are based on sample designs with one or more complex features such as stratification, clustering of sampled elements, and weighting to compensate for differential probabilities of sample inclusion or varying response rates. Therefore, it is essential to incorporate this design information for survey data analysis (Cochran 1977). Survey data analysis procedures accounting for the design information are readily available in SAS ® (Section 3).

He and Zhang
Page 2 Surv Stat. Author manuscript; available in PMC 2023 August 12.
The above principle also holds for analyzing multiply-imputed complex survey data. Additionally, a principled MI procedure for complex survey missing data problems should also include the design information in the imputation process. However, there exist alternative practical options for incorporating the sample design (e.g., He et al. 2022, Section 10.3). Here we outline a hierarchical, trial-and-error strategy:

1.
Include the survey weight as a variable (predictor) in the imputation;

2.
To include information about the sampling strata and clusters: (2.1) First, create a new categorical variable that combines the sampling strata and the nested clusters, and include this variable in the imputation; (2.2) If the imputation model has some estimation issues due to a large number of categories from the above combining variable, then collapse clusters within a sampling stratum for clusters with small sample sizes or only includes the sampling strata variable in the imputation; (2.3) If the model estimation issue still exists because some strata only have very few units then collapse these small-sample strata together to ensure each final stratum has a sufficient sample size, and then include the collapsed-strata variable in the imputation.
An additional major challenge for surveys is that missing data often happen for multiple variables, and this issue is usually coupled with another fact that survey variables are typically bounded. A feasible MI approach is the so-called "Fully Conditional Specification" (FCS) strategy, which imputes each incomplete variable based on a model that includes all other variables as the predictors and then cycles through all missing variables sequentially. FCS is arguably the most popular MI strategy for multivariate survey missing data problems (He et al. 2022, Chapter. 7).

Major SAS ® Procedures
The two main SAS ® procedures for MI are PROC MI and PROC MIANALYZE. Other SAS ® procedures and data steps are also often used, depending on the analytic goals and contexts. Here we outline five major programming stages in a typical MI analysis.
Stage 1 (processing): Processing data before imputation to construct the working dataset including both the target missing and fully-observed variables. Exploratory analyses are often conducted at this stage.
Stage 2 (imputation): Running imputation M times by applying PROC MI to the working dataset.
Stage 3 (analysis): Applying the planned (post-imputation) analysis to the completed datasets by running SAS ® statistical procedures. In the context of complex survey data, these procedures typically include PROC SURVEYMEANS, PROC SURVEYREG, etc. Stage 4 (combining): Combining the results to yield the final estimates with PROC MIANALYZE.
Stage 5 (evaluation): An evaluation analysis that typically compares results among different MI models and with the case-wise deletion method.

Data Background
The example is illustrated using a subset of Research and Development Survey (RANDS) (https://www.cdc.gov/nchs/rands/), a series of probability-sampled web-based surveys conducted by the National Center for Health Statistics (e.g., He et al, 2020). Specifically, we use some variables from the publicly released RANDS during COVID-19 data (the 3 rd round), which is a special series of RANDS used to rapidly report on the impact of the COVID-19 pandemic (Irimata and Scanlon, 2022). The original dataset contains 5,458 records; it can be downloaded from (https://www.cdc.gov/nchs/rands/data.htm). Table 1 briefly describes the variables used in the example.  If rnumber_INCOME < p_miss_INCOME then R_miss_INCOME =1; else R_miss_INCOME=0;

Sample Code and Output
If R_miss_INCOME = 1 then INCOME=.; run; In SAS ® , missing values are coded by "." (dot). In the code above, INCOME is set as missing if a uniform random number is less than a pre-specified missingness probability, which is related to other variables by a logit function. As outlined in Section 3.1, additional SAS ® data steps and exploratory analyses can be done for the data processing stage of the MI analysis. We provide some additional remarks about the above code.

a.
The input dataset is "rands_covid_missing"; the output dataset containing the multiple imputation results is "income_impute"; "nimpute=" specifies the number of imputations (we use 5 in this example); "seed=" specifies the initial random seed used in MI. Fixing the random seed can render reproducible results.

b.
The variables included in the imputation are specified after "var". Among them, categorical variables are specified after "class".

c.
To include the design variables, we initially include WEIGHT_CALIBRATED and the combined strata and PSU variable (S_VSTRAT and S_VPSU, respectively) in the model (after "var"). However, the model has estimation problems because some sampling strata have very few samples. As a result, SAS ® would issue warnings in log files. They would also be noticed by checking the regression coefficients of the output. Therefore, we collapse some small strata so that each final stratum has at least 10 samples, which is coded by the new variable S_VSTRAT_COMBINE. We also exclude S_VPSU from the model.

d.
fcs nbiter=20 reg (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment). This statement specifies that we use FCS to impute both INCOME and MARITAL_NEW. Specifically, "nbiter=20" specifies 20 iterations are to be used; "reg (INCOME/details)" specifies a linear regression model for INCOME, and the "details" option asks for outputting the regression coefficients of the model fit across all imputations; "logistic/details" specifies a logistic regression imputation model for MARITAL_NEW with coefficients output; "likelihood=augment" specifies a robust logistic regression to deal with possible data separation issues (e.g., He et al. 2022, Section 4.3.2.4).

e.
We can specify "min=1" and "max=16" after "proc" to force the imputed values of INCOME falling in this range. For the variables that do not need the bounds, their "min" and "max" are assigned as missing values.
This statement (commented out with a "*") specifies another modeling option: a PMM imputation for INCOME and a logistic regression imputation for MARITAL_NEW.

g.
fcs nbiter=20 reg (INCOME/details) discrim (MARITAL_NEW/classeffects =include details). This statement (commented out with a "*") specifies another modeling option: a linear normal imputation for INCOME and a discriminant analysis model for MARITAL_NEW. For the latter, "classeffects=include" specifies that all of the remaining variables, both continuous and categorical, are included in the discriminant analysis.
h. fcs nbiter=20 regpmm (INCOME/details) discrim (MARITAL_NEW/classeffects =include details). This statement (commented out with a "*") specifies another modeling option: a PMM imputation for INCOME and a discriminant analysis model for MARITAL_NEW. We now include some output from the above code and provide remarks. For ease of illustration, we separate the output into four parts and then comment on them one by one.

Output 1
The MI Procedure

Model Information
Data Set WORK.RANDS_COVID_MISSING

Number of Burn-in Iterations 20
Seed for random number generator 197789

Discriminant Function EDUC GENDER INTERNET S_VSTRAT_COMBINE
Output 1 provides some general information about the imputation model setup and the variables included. For categorical variables, the discriminant analysis imputation model is the default option.
Output 2 shows the missingness pattern of the variables and some descriptive statistics of the associated subgroups. Specifically, Group 1 has all variables fully observed, denoted by 'X" for each variable; Group 2 has only MARITAL_NEW with missing values (denoted by "."); Group 3 has only INCOME with missing values; and Group 4 has missing values on both INCOME and MARITAL_NEW. The means of the continuous variables of each subgroup are also displayed. For instance, the average age from Group 1 (=53.386) is higher than those from the other three groups.
Output 2 also shows that the data have an arbitrary missing data pattern. On the opposite, a monotone missingness pattern is usually seen in longitudinal studies where once a subject drops out, his/her measurements at later times are always missing. Note that PROC MI has He and Zhang Page 7 Surv Stat. Author manuscript; available in PMC 2023 August 12. specific options for imputing monotone missing data. However, for brevity, they are not covered in this paper.

Output 3
Regression Output 3 shows some details about the fit for each of the imputation models used in FCS. If we use the modeling option "fcs nbiter=20 reg (INCOME/details) logistic (MARITAL_NEW/details likelihood=augment)" in PROC MI, then the output contains the linear regression coefficients for INCOME and logistic regression coefficients for MARITAL_NEW across 5 imputations. For simplicity we do not include all coefficients here. Specifically, the results under "Regression Models for FCS Method" lists the coefficients for fitting INCOME. For example, the coefficient for AGE is 0.020476 for the 1 st imputation, 0.029064 for the 2 nd imputation, etc. The results under "Logistic Models for FCS Method" lists the coefficients for fitting MARITAL_NEW. For instance, the coefficient for AGE is −0.524000 for the 1 st imputation, −0.520803 for the 2 nd imputation, etc.
We previously discussed the need for collapsing some small strata and excluding clusters to achieve stable model estimates. If this was not implemented, in addition to seeing warning statements from SAS ® log files, we would also see some very extreme logistic regression coefficients (e.g., outside the range [−5,5]) in Output 3.

Output 4
He For illustration, we estimate the overall mean of INCOME and MARITAL_NEW using PROC SURVEYMEANS, which uses the survey design information including strata, clusters, and weights. The working dataset "data=income_impute" reads the output dataset from PROC MI. In that dataset, a variable "_imputation_" is used to label the number of imputations (i.e., 1-5), and the dataset has 27,290 (=5458×5) records. A "by" option is used to run the analyses separately. Finally, the "ods output statistics = mean_income_imp" is used to store the output of the 5 analyses in the dataset "mean_income_imp" for carrying out the combining step in Stage 4. The procedure reads in the dataset mean_income_imp, which contains the separate estimates from the multiply-imputed datasets. The option "EDF= " is not the default but necessary for complex survey data analysis because it specifies the degrees of freedom in the combining step. In this example, we specify the degrees of freedom as the number of clusters minus the number of strata in the dataset. The statement "modeleffects mean" specifies that the estimand for combining is the mean estimates. The statement "stderr stderr" lists standard errors associated with the means. "where varname = 'INCOME'" indicates that the combining step only applies to INCOME. Finally, "ods output parameterestimates=MI_results_income" saves the combined estimates to the dataset MI_results_income.

Output 6
The MIANALYZE Procedure Output 6 shows the results from PROC MIANALYZE. The combined mean estimate of INCOME is 10.230448, its standard error is 0.120219, and the 95% confidence limits are (9.991212, 10.46968). Detailed explanations of other statistics (e.g., between/within variance) can be found in the literature (e.g., He et al. 2022, Chapter. 3).
Stage 5: We conduct some diagnostics and evaluation. We have considered different modeling options for INCOME and MARITAL_NEW (Section 3.2.2). In this example, since we create the missing values, the imputation analysis results can also be compared with those from complete data as well as from the case-deletion method. The programming code for Stage 5 would be running different MI models and analyses (e.g., remark (d)-(h) after PROC MI in Section 3.3). Omitting the details, the evaluation results are summarized in Table 2.
The mean estimates from the case-deletion are considerably lower than the complete-data analysis due to MAR. In general, all MI methods correct for the biases somewhat. In addition, MI analyses yield generally narrower confidence intervals than the case-deletion method. Among different MI methods applied, it seems that when INCOME is imputed via PMM, the corresponding results are the closest to the complete-data analysis for both variables. Therefore, we would choose PMM+logit as the final MI modeling option.